Spatial data validations #581

tdikland · 2025-09-19T11:22:43Z

Changes

Add specialised data validations for geospatial data.

The following checks are implemented:

Linked issues

Resolves #453

Tests

github-actions · 2025-09-19T11:22:57Z

All commits in PR should be signed ('git commit -S ...'). See https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

src/databricks/labs/dqx/geo/check_funcs.py

mwojtyczka

Please sign all commits

src/databricks/labs/dqx/geo/check_funcs.py

tests/integration/test_row_checks_geo.py

Copilot

Pull Request Overview

This PR introduces specialized data validation functions for geospatial data types, implementing validation checks for geometry and geography columns.

Adds is_valid_geometry and is_valid_geography functions for spatial data validation
Implements row-level validation using Databricks-specific SQL functions
Provides comprehensive test coverage for both validation functions

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/databricks/labs/dqx/geo/check_funcs.py	Implements geospatial validation functions using try_to_geometry and try_to_geography
tests/integration/test_row_checks_geo.py	Integration tests validating the behavior of geometry and geography check functions

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/databricks/labs/dqx/geo/check_funcs.py

tdikland · 2025-09-27T09:45:10Z

@mwojtyczka Can we do another review? I think the basics are there.

mwojtyczka

Please add documentation here: https://github.com/tdikland/dqx/blob/feat/geo/docs/dqx/docs/reference/quality_checks.mdx#row-level-checks-reference

Description of checks functions and examples that use classes and yaml.

Please add integration tests similar to test_apply_checks_all_checks_using_classes and test_apply_checks_all_row_checks_as_yaml_with_streaming.

Please add perf tests: https://github.com/tdikland/dqx/blob/feat/geo/tests/perf/test_apply_checks.py

src/databricks/labs/dqx/geo/check_funcs.py

tests/integration/test_row_checks_geo.py

Copilot

Pull Request Overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/integration/test_row_checks_geo.py

src/databricks/labs/dqx/geo/check_funcs.py

mwojtyczka

LGTM

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

* Added support for running checks on multiple tables ([#566](#566)). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as `profiler_max_parallelism` and `quality_checker_max_parallelism`. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX. * Added New Row-level Checks: IPv6 Address Validation ([#578](#578)). DQX now includes 2 new row-level checks: validation of IPv6 address (`is_valid_ipv6_address` check function), and validation if IPv6 address is within provided CIDR block (`is_ipv6_address_in_cidr` check function). * Added New Dataset-level Check: Schema Validation check ([#568](#568)). The `has_valid_schema` check function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode. * Added New Row-level Checks: Spatial data validations ([#581](#581)). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including `is_latitude`, `is_longitude`, `is_geometry`, `is_geography`, `is_point`, `is_linestring`, `is_polygon`, `is_multipoint`, `is_multilinestring`, `is_multipolygon`, `is_ogc_valid`, `is_non_empty_geometry`, `has_dimension`, `has_x_coordinate_between`, and `has_y_coordinate_between`. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above. * Added absolute and relative tolerance to comparison of datasets ([#574](#574)). The `compare_datasets` check has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns. * Added detailed telemetry ([#561](#561)). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users. * Allow installation in a custom folder ([#575](#575)). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation. Allowing custom installation folder makes it possible to use DQX on [group assigned cluster](https://docs.databricks.com/aws/en/compute/group-access). * Profile subset dataframe ([#589](#589)). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions. * Added custom exceptions ([#582](#582)). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions. BREAKING CHANGES! * Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only. * The following depreciated methods are removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_table`, `load_checks_from_installation`, `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_table`,, `save_checks_in_installation`, `load_run_config`. For loading and saving checks, users are advised to use `load_checks` and `save_checks` of the `DQEngine` described [here](https://databrickslabs.github.io/dqx/docs/guide/quality_checks_storage/), which support various storage types.

ghanse reviewed Sep 21, 2025

View reviewed changes

src/databricks/labs/dqx/geo/check_funcs.py Outdated Show resolved Hide resolved

ghanse reviewed Sep 21, 2025

View reviewed changes

src/databricks/labs/dqx/geo/check_funcs.py Outdated Show resolved Hide resolved

mwojtyczka requested changes Sep 22, 2025

View reviewed changes

src/databricks/labs/dqx/geo/check_funcs.py Show resolved Hide resolved

src/databricks/labs/dqx/geo/check_funcs.py Show resolved Hide resolved

tests/integration/test_row_checks_geo.py Outdated Show resolved Hide resolved

tests/integration/test_row_checks_geo.py Outdated Show resolved Hide resolved

mwojtyczka requested a review from Copilot September 22, 2025 15:28

Copilot AI reviewed Sep 22, 2025

View reviewed changes

src/databricks/labs/dqx/geo/check_funcs.py Show resolved Hide resolved

tdikland force-pushed the feat/geo branch 2 times, most recently from b8c0af8 to 90cb668 Compare September 26, 2025 06:28

Initial geospatal data quality checks

643dd20

tdikland force-pushed the feat/geo branch from 90cb668 to 643dd20 Compare September 26, 2025 06:35

expanded geospatial data checks

b9b7b3d

tdikland force-pushed the feat/geo branch from eaab665 to b9b7b3d Compare September 27, 2025 08:52

Merge branch 'main' into feat/geo

5e313c6

tdikland marked this pull request as ready for review September 27, 2025 09:46

tdikland requested a review from a team as a code owner September 27, 2025 09:46

tdikland requested review from grusin-db and removed request for a team September 27, 2025 09:46

tdikland changed the title ~~[DRAFT] Spatial data validations~~ Spatial data validations Sep 27, 2025

mwojtyczka requested changes Sep 30, 2025

View reviewed changes

Merge branch 'main' into feat/geo

36f8f77

mwojtyczka requested a review from Copilot September 30, 2025 10:32

Copilot AI reviewed Sep 30, 2025

View reviewed changes

mwojtyczka added 7 commits September 30, 2025 12:58

Update src/databricks/labs/dqx/geo/check_funcs.py

1554d06

Update src/databricks/labs/dqx/geo/check_funcs.py

f13d798

Update src/databricks/labs/dqx/geo/check_funcs.py

38d2bfc

Update src/databricks/labs/dqx/geo/check_funcs.py

d3b50f2

Update src/databricks/labs/dqx/geo/check_funcs.py

d678566

Update tests/integration/test_row_checks_geo.py

d2ece65

Update tests/integration/test_row_checks_geo.py

8756303

mwojtyczka and others added 20 commits September 30, 2025 13:02

Update src/databricks/labs/dqx/geo/check_funcs.py

c9760a3

Update src/databricks/labs/dqx/geo/check_funcs.py

775cb84

Update src/databricks/labs/dqx/geo/check_funcs.py

91adf6a

Apply suggestion from @mwojtyczka

560feeb

Apply suggestion from @mwojtyczka

7db6db9

Apply suggestion from @mwojtyczka

0027072

Apply suggestion from @mwojtyczka

80197d7

Apply suggestion from @mwojtyczka

beb3217

Apply suggestion from @mwojtyczka

945d031

Apply suggestion from @mwojtyczka

0523d67

Apply suggestion from @mwojtyczka

0b69618

Apply suggestion from @mwojtyczka

7483b6a

corrected tests, fmt

0daae5e

remove todos

0e4b4fa

check if runtime is geo compatible

5922211

expanded integration tests

96f0823

add benchmarks for geo check functions

6525f4e

integation test fixes pt1

1844672

integation test fixes pt2

5381a2b

Merge branch 'main' into feat/geo

038677c

mwojtyczka requested a review from Copilot October 2, 2025 20:21

updated docs

22aae45

mwojtyczka approved these changes Oct 2, 2025

View reviewed changes

updated tests and docs

1173158

Copilot AI reviewed Oct 2, 2025

View reviewed changes

updated tests

77f0e48

mwojtyczka merged commit d2a37a8 into databrickslabs:main Oct 2, 2025
14 checks passed

mwojtyczka mentioned this pull request Oct 3, 2025

Release v0.9.3 #597

Merged

Spatial data validations #581

Spatial data validations #581

Uh oh!

Conversation

tdikland commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Linked issues

Tests

Uh oh!

github-actions bot commented Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

tdikland commented Sep 27, 2025

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdikland commented Sep 19, 2025 •

edited

Loading